🕸️ Ada Research Browser

security-guide.md
← Back

Security Architecture Guide

This document describes the security architecture of the Secure Runtime Environment (SRE) platform. It covers the defense-in-depth model, zero-trust network architecture, workload hardening, supply chain security, identity management, secrets handling, cryptographic controls, audit and monitoring, incident response procedures, and threat modeling.

All security controls are mapped to NIST 800-53 Rev 5 control identifiers. For the complete control-to-component crosswalk, see docs/agent-docs/compliance-mapping.md.


Table of Contents

  1. Security Architecture Overview
  2. Network Security
  3. Workload Security
  4. Supply Chain Security
  5. Identity and Access Management
  6. Secrets Management
  7. Cryptography
  8. Audit and Monitoring
  9. Incident Response
  10. Threat Model
  11. Security Exceptions
  12. References

1. Security Architecture Overview

1.1 Defense-in-Depth Model

SRE implements defense-in-depth through four reinforcing security layers. A compromise at one layer is contained by the controls at adjacent layers. No single component is a single point of failure for platform security.

+-----------------------------------------------------------------------+
|  Layer 1: Infrastructure Foundation                                    |
|  Rocky Linux 9 (DISA STIG) + FIPS 140-2 + SELinux Enforcing          |
|  RKE2 (CIS Benchmark + DISA STIG) + etcd encryption at rest          |
+-----------------------------------------------------------------------+
|  Layer 2: Network Security                                            |
|  Istio mTLS STRICT + NetworkPolicies (default deny)                   |
|  AuthorizationPolicies + NeuVector network segmentation               |
+-----------------------------------------------------------------------+
|  Layer 3: Workload Security                                           |
|  Kyverno admission control + Pod Security Contexts                    |
|  NeuVector runtime protection + process/file monitoring               |
+-----------------------------------------------------------------------+
|  Layer 4: Supply Chain Security                                       |
|  Harbor + Trivy scanning + Cosign signing + SBOM generation           |
|  Kyverno image verification + registry restriction                    |
+-----------------------------------------------------------------------+
|  Cross-Cutting: Identity, Secrets, Audit, Monitoring                  |
|  Keycloak SSO/MFA + OpenBao + Loki/Prometheus/Grafana                 |
+-----------------------------------------------------------------------+

Each layer operates independently so that:

1.2 Zero-Trust Architecture

SRE follows zero-trust principles as defined in NIST SP 800-207:

1.3 Compliance Alignment

The security architecture is designed to satisfy:

Framework Coverage
NIST 800-53 Rev 5 AC, AU, CA, CM, IA, IR, MP, RA, SA, SC, SI control families
CMMC 2.0 Level 2 All 110 practices (mapped from NIST 800-171 r2)
FedRAMP Moderate Inherited from NIST 800-53 coverage
DISA STIGs RKE2 Kubernetes STIG, RHEL 9 STIG (Rocky Linux 9), Istio STIG
CIS Benchmarks Kubernetes CIS 1.23 (RKE2 default), Rocky Linux 9 CIS Level 2

2. Network Security

Network security is enforced at three independent layers: the CNI (Kubernetes NetworkPolicies), the service mesh (Istio mTLS and AuthorizationPolicies), and the runtime security engine (NeuVector network segmentation). These layers are complementary, not redundant -- each operates at a different abstraction level and catches different classes of attack.

2.1 Istio Mutual TLS -- STRICT Mode

NIST Controls: SC-8 (Transmission Confidentiality and Integrity), SC-13 (Cryptographic Protection), IA-3 (Device Identification and Authentication), AC-14 (Permitted Actions Without Identification)

All pod-to-pod communication within the cluster is encrypted and mutually authenticated using Istio's STRICT mTLS mode. This is enforced by a cluster-wide PeerAuthentication resource deployed to istio-system:

# platform/core/istio/peer-authentication.yaml
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata:
  name: default
  namespace: istio-system
spec:
  mtls:
    mode: STRICT

Key properties of STRICT mTLS enforcement:

2.2 Kubernetes NetworkPolicies

NIST Controls: AC-4 (Information Flow Enforcement), SC-7 (Boundary Protection)

Every namespace in the platform starts with a default-deny NetworkPolicy that blocks all ingress and egress traffic:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: <namespace>
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Explicit allow rules are then added for each required communication path. This is enforced for all platform namespaces (istio-system, kyverno, monitoring, logging, neuvector, cert-manager, openbao, harbor) and all tenant namespaces.

Standard allow patterns for tenant namespaces:

Direction Source/Destination Purpose
Ingress istio-system Istio sidecar injection and gateway traffic
Ingress monitoring Prometheus metric scraping
Egress kube-system (port 53) DNS resolution
Egress istio-system Mesh control plane communication
Egress Service-specific targets Application dependencies (explicitly listed)

NetworkPolicies operate at the CNI layer (Layer 3/4) and are enforced independently of Istio. Even if the Istio sidecar is bypassed or misconfigured, the CNI-level deny rules remain active.

2.3 Istio AuthorizationPolicies

NIST Controls: AC-3 (Access Enforcement), AC-4 (Information Flow Enforcement), AC-6 (Least Privilege)

Istio AuthorizationPolicies provide Layer 7 access control for service-to-service communication. They operate on top of mTLS -- once a connection is authenticated, the AuthorizationPolicy determines whether the specific request (method, path, headers) is permitted.

The default posture is deny-all. Each namespace receives a deny-all AuthorizationPolicy, and explicit ALLOW rules are added for each required communication path:

# Deny all traffic by default
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: deny-all
  namespace: istio-system
spec:
  {}
# Explicitly allow specific traffic
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: allow-ingress-gateway
  namespace: istio-system
spec:
  selector:
    matchLabels:
      istio: gateway
  action: ALLOW
  rules:
    - from:
        - source:
            ipBlocks:
              - "0.0.0.0/0"

AuthorizationPolicies support matching on:

This allows fine-grained rules such as "only the frontend service in the team-alpha namespace may call GET /api/v1/users on the user-service."

2.4 NeuVector Network Segmentation

NIST Controls: SC-7 (Boundary Protection), SI-4 (System Monitoring)

NeuVector provides an additional network security layer with Layer 7 deep packet inspection, DLP (Data Loss Prevention), and WAF (Web Application Firewall) capabilities.

Key features:

NeuVector network rules complement (not replace) Kubernetes NetworkPolicies and Istio AuthorizationPolicies. NetworkPolicies enforce at Layer 3/4, Istio at Layer 7 with identity, and NeuVector at Layer 7 with content inspection.

2.5 Boundary Protection

NIST Controls: SC-7 (Boundary Protection)

All external (north-south) traffic enters the cluster through a single controlled ingress point: the Istio ingress gateway.

# platform/core/istio/gateway.yaml
apiVersion: networking.istio.io/v1
kind: Gateway
metadata:
  name: main
  namespace: istio-system
spec:
  selector:
    istio: gateway
  servers:
    - port:
        number: 443
        name: https
        protocol: HTTPS
      tls:
        mode: SIMPLE
        credentialName: sre-wildcard-tls
      hosts:
        - "*.apps.sre.example.com"
    - port:
        number: 80
        name: http
        protocol: HTTP
      tls:
        httpsRedirect: true
      hosts:
        - "*.apps.sre.example.com"

Properties of the ingress boundary:


3. Workload Security

3.1 Pod Security Contexts

NIST Controls: AC-6 (Least Privilege), CM-7 (Least Functionality), AC-6(10) (Prohibit Non-Privileged Users from Executing Privileged Functions)

Every workload deployed to SRE must run with a restricted security context. The following settings are mandatory and enforced by Kyverno admission control:

spec:
  template:
    spec:
      automountServiceAccountToken: false
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
      containers:
        - name: app
          securityContext:
            allowPrivilegeEscalation: false
            readOnlyRootFilesystem: true
            runAsNonRoot: true
            capabilities:
              drop:
                - ALL
Setting Purpose NIST Control
runAsNonRoot: true Prevents containers from running as UID 0 AC-6
allowPrivilegeEscalation: false Blocks setuid/setgid binaries and no_new_privs AC-6(10)
readOnlyRootFilesystem: true Prevents runtime modification of the container filesystem CM-7
capabilities.drop: [ALL] Removes all Linux capabilities (NET_RAW, SYS_ADMIN, etc.) AC-6
seccompProfile.type: RuntimeDefault Applies the container runtime's default seccomp profile, blocking dangerous syscalls CM-7
automountServiceAccountToken: false Prevents automatic mounting of the ServiceAccount token, reducing credential exposure AC-6

3.2 Kyverno Policy Enforcement

NIST Controls: AC-3 (Access Enforcement), CM-6 (Configuration Settings), CM-7 (Least Functionality), SI-7 (Software Integrity)

Kyverno operates as a Kubernetes admission controller, intercepting every API request to validate and optionally mutate resources before they are persisted to etcd. Kyverno is deployed with 3 admission controller replicas for high availability.

The platform enforces three tiers of policies:

Baseline Policies (cluster-wide, applied to all namespaces):

Policy What It Blocks NIST Controls
disallow-privileged-containers securityContext.privileged: true AC-6(10), CM-7
disallow-host-namespaces hostNetwork, hostPID, hostIPC AC-6, CM-7
disallow-host-ports hostPort bindings SC-7
restrict-sysctls Unsafe sysctl settings CM-7

Restricted Policies (applied to tenant namespaces):

Policy What It Enforces NIST Controls
require-run-as-nonroot runAsNonRoot: true on all pods AC-6
require-drop-all-capabilities capabilities.drop: [ALL] on all containers AC-6, CM-7
restrict-volume-types Only configMap, emptyDir, secret, persistentVolumeClaim, projected CM-7
disallow-privilege-escalation allowPrivilegeEscalation: false AC-6(10)

Custom SRE Policies (platform-specific enforcement):

Policy What It Enforces NIST Controls
require-labels Standard labels (app.kubernetes.io/name, sre.io/team, etc.) CM-8
require-resource-limits CPU and memory requests/limits on all containers SC-5
restrict-image-registries Only harbor.sre.internal/* images permitted CM-11, SI-7
verify-image-signatures Cosign signature verification on all images SA-10, SI-7
disallow-default-namespace No workloads in the default namespace CM-7
disallow-latest-tag No :latest image tags CM-2
require-network-policies Every namespace must have a default-deny NetworkPolicy AC-4, SC-7
require-probes Liveness and readiness probes required SI-4

All policies run in Audit mode in development environments and Enforce mode in production. This is controlled via Flux value overlays. Policies in background: true mode continuously scan existing resources and generate PolicyReport CRDs, feeding compliance dashboards in Grafana.

Platform namespaces (kube-system, istio-system, flux-system, kyverno, cert-manager) are excluded from restricted and custom policies via exclude blocks to prevent self-denial scenarios.

3.3 NeuVector Runtime Protection

NIST Controls: SI-3 (Malicious Code Protection), SI-4 (System Monitoring), CM-7 (Least Functionality), IR-4 (Incident Handling)

NeuVector provides runtime security through three enforcement mechanisms:

Process monitoring. NeuVector maintains a process profile for every container. In Discover mode, it learns the expected process tree. When promoted to Monitor or Protect mode, any process not in the baseline triggers an alert (Monitor) or is terminated (Protect).

File system monitoring. NeuVector monitors file access patterns within containers. Unauthorized writes to sensitive paths (e.g., /etc/passwd, /usr/bin/, container entrypoint modifications) trigger alerts or blocking actions.

CIS benchmark scanning. NeuVector continuously evaluates running containers and the host node against CIS benchmarks, reporting deviations through its API and to the monitoring stack via Prometheus metrics.

NeuVector operates as a privileged DaemonSet (neuvector-enforcer) on every node. This is a documented security exception -- the enforcer requires privileged access to inspect container runtime state, network traffic, and host processes. See Section 11: Security Exceptions.

The NeuVector admission control webhook operates in the admission chain alongside Kyverno:

API Server --> Kyverno (mutate/validate) --> NeuVector admission --> Pod created --> Istio sidecar injected

NeuVector admission control can reject pods based on: - Image vulnerability severity thresholds (e.g., reject if CRITICAL CVEs present) - Compliance check failures - Process profile violations from previous runs


4. Supply Chain Security

4.1 Image Pipeline

NIST Controls: SA-10 (Developer Configuration Management), SI-7 (Software Integrity), RA-5 (Vulnerability Scanning), CM-11 (User-Installed Software)

The SRE image pipeline ensures that only trusted, scanned, signed, and catalogued images reach the cluster:

Developer Dockerfile
  |
  v
CI Pipeline (GitHub Actions / GitLab CI)
  |-- Build container image
  |-- Scan with Trivy (FAIL on CRITICAL/HIGH)
  |-- Generate SBOM with Syft (SPDX + CycloneDX)
  |-- Sign image with Cosign
  |-- Push to Harbor
  |-- Attach SBOM as OCI artifact
  v
Harbor (harbor.sre.internal)
  |-- Trivy scan on push (second scan gate)
  |-- Store Cosign signature
  |-- Store SBOM
  v
Deployment via Flux (GitOps image tag bump)
  |-- Kyverno: verify image is from harbor.sre.internal
  |-- Kyverno: verify Cosign signature
  |-- NeuVector: check vulnerability threshold
  v
Pod created with verified, scanned, signed image

4.2 Image Scanning with Harbor and Trivy

NIST Controls: RA-5 (Vulnerability Scanning), SI-2 (Flaw Remediation)

Harbor runs Trivy as its integrated vulnerability scanner. Every image pushed to Harbor is automatically scanned. The scanning pipeline enforces:

4.3 Image Signing with Cosign

NIST Controls: SI-7 (Software Integrity), SA-10 (Developer Configuration Management)

All images are signed with Cosign during the CI pipeline using a private key stored in OpenBao. The corresponding public key is embedded in the Kyverno verify-image-signatures ClusterPolicy:

# policies/custom/verify-image-signatures.yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signatures
  annotations:
    sre.io/nist-controls: "SA-10, SI-7"
spec:
  validationFailureAction: Audit  # Enforce in production
  webhookTimeoutSeconds: 30
  rules:
    - name: verify-cosign-signature
      match:
        any:
          - resources:
              kinds:
                - Pod
      verifyImages:
        - imageReferences:
            - "harbor.sre.internal/*"
          attestors:
            - entries:
                - keys:
                    publicKeys: |-
                      -----BEGIN PUBLIC KEY-----
                      REPLACE_ME_WITH_COSIGN_PUBLIC_KEY
                      -----END PUBLIC KEY-----

This policy verifies that:

In production, validationFailureAction is set to Enforce, causing any unsigned or incorrectly signed image to be rejected at admission.

4.4 Registry Restriction

NIST Controls: CM-11 (User-Installed Software), SI-7 (Software Integrity)

The restrict-image-registries Kyverno ClusterPolicy ensures that only images from harbor.sre.internal/* are permitted in the cluster. This applies to both containers and initContainers:

# policies/custom/restrict-image-registries.yaml
spec:
  rules:
    - name: validate-image-registry
      validate:
        message: >-
          All container images must be from the approved registry
          harbor.sre.internal/. Push your image to Harbor first.
        pattern:
          spec:
            containers:
              - image: "harbor.sre.internal/*"

This ensures:

4.5 Software Bill of Materials (SBOM)

NIST Controls: CM-8 (Information System Component Inventory), SA-11 (Developer Testing)

SBOMs are generated during the CI pipeline using Syft in two formats:

SBOMs are attached to images as OCI artifacts in Harbor and signed with Cosign attestation. This enables:


5. Identity and Access Management

5.1 Keycloak SSO with Multi-Factor Authentication

NIST Controls: IA-2 (Identification and Authentication), AC-2 (Account Management), AC-17 (Remote Access)

Keycloak is the centralized identity provider for all platform management interfaces. All UIs require authentication through Keycloak before access is granted:

Service Authentication Method
Grafana OIDC via Keycloak
Harbor OIDC via Keycloak
NeuVector Manager OIDC via Keycloak
Kiali OIDC via Keycloak
Backstage (optional) OIDC via Keycloak
Kubernetes API OIDC via Keycloak (or client certificate)

Keycloak enforces:

5.2 Kubernetes RBAC

NIST Controls: AC-3 (Access Enforcement), AC-6 (Least Privilege), AC-6(1) (Authorize Access to Security Functions)

Kubernetes RBAC is configured with the principle of least privilege:

Platform administrators have access to platform namespaces via ClusterRoles mapped to the platform-admins Keycloak group. Even administrators do not have unrestricted cluster-admin access -- they receive only the permissions needed for platform operations.

Tenant teams receive namespace-scoped Roles bound to their Keycloak group. A team member in the team-alpha Keycloak group receives permissions only within the team-alpha namespace.

Flux service account is the only identity with write access to platform-managed namespaces. Direct kubectl apply against platform resources is restricted by Kyverno policies that validate the requesting identity.

ServiceAccount isolation. Each workload gets its own ServiceAccount with automountServiceAccountToken: false by default. Tokens are mounted explicitly only when Kubernetes API access is required.

5.3 Istio Service Identity

NIST Controls: IA-3 (Device Identification and Authentication), AC-14 (Permitted Actions Without Identification)

Every workload in the mesh receives a SPIFFE identity:

spiffe://sre.example.com/ns/<namespace>/sa/<service-account>

This identity is:


6. Secrets Management

6.1 OpenBao Architecture

NIST Controls: IA-5 (Authenticator Management), SC-12 (Cryptographic Key Establishment), SC-28 (Protection of Information at Rest)

OpenBao (the open-source fork of HashiCorp Vault, fully API-compatible) is deployed in HA mode with Raft storage for secrets management.

Encryption at rest. All data stored in OpenBao's Raft backend is encrypted using AES-256-GCM. The master encryption key is itself encrypted by the unseal key.

Auto-unseal. In production, OpenBao uses cloud KMS (AWS KMS or Azure Key Vault) for automatic unsealing, eliminating the need for manual unseal operations after a restart. In development, a transit-based unseal mechanism is used.

Secrets engines configured:

Engine Path Purpose
KV v2 sre/ Static secrets for applications (database URLs, API keys)
PKI pki/ Internal CA for service certificates
Database database/ Dynamic database credentials with automatic rotation
Transit transit/ Encryption-as-a-service for applications

Audit logging. Every access to OpenBao is logged. Audit logs are forwarded to Loki via Alloy for centralized retention and analysis. This satisfies NIST AU-2 (Audit Events) for secrets access.

6.2 External Secrets Operator (ESO) Sync Pattern

NIST Controls: SC-28 (Protection of Information at Rest), CM-3 (Configuration Change Control)

The External Secrets Operator bridges OpenBao and Kubernetes. Instead of storing secrets in Git or passing them through the CI pipeline, ESO pulls secrets from OpenBao and creates native Kubernetes Secrets:

OpenBao (source of truth)
  |
  v
ExternalSecret CRD (in Git, no secret values)
  |
  v
ESO controller (syncs on interval)
  |
  v
Kubernetes Secret (native, consumed by pods)

Key properties:

6.3 Secret Rotation

NIST Controls: IA-5 (Authenticator Management), SC-12 (Cryptographic Key Establishment)

Secret rotation is enforced at multiple levels:


7. Cryptography

7.1 FIPS 140-2 Mode

NIST Controls: SC-13 (Cryptographic Protection)

FIPS 140-2 compliance is enforced at two levels:

Operating system. Rocky Linux 9 is configured with the FIPS crypto policy via Ansible:

# Applied during OS hardening
fips-mode-setup --enable
update-crypto-policies --set FIPS

This restricts the OS to using only FIPS 140-2 validated cryptographic modules for all operations, including SSH, TLS, and disk encryption.

Kubernetes. RKE2 is compiled with the GoBoring Go compiler, which links against BoringCrypto -- a FIPS 140-2 validated cryptographic module. All Kubernetes control plane communications (API server, etcd, scheduler, controller-manager) use FIPS-validated TLS.

The combination ensures that all cryptographic operations -- from SSH access to the node, to Kubernetes API calls, to etcd data encryption, to Istio mTLS -- use FIPS 140-2 validated algorithms.

7.2 TLS Everywhere

NIST Controls: SC-8 (Transmission Confidentiality and Integrity), SC-12 (Cryptographic Key Establishment)

TLS is used for every communication path:

Path TLS Method Certificate Source
Client to Istio gateway TLS 1.2+ (SIMPLE mode) cert-manager (Let's Encrypt or internal CA)
Gateway to backend pods Istio mTLS Istio Citadel (automatic)
Pod to pod (in mesh) Istio mTLS (STRICT) Istio Citadel (automatic)
Kubernetes API server TLS 1.2+ RKE2 internal CA
etcd cluster TLS 1.2+ RKE2 internal CA
OpenBao API TLS 1.2+ cert-manager
Harbor registry TLS 1.2+ cert-manager
Grafana, Kiali, NeuVector UIs TLS 1.2+ via Istio gateway cert-manager

cert-manager automates certificate lifecycle management:

7.3 Encryption at Rest

NIST Controls: SC-28 (Protection of Information at Rest), MP-2 (Media Access)

Data at rest is encrypted at multiple levels:


8. Audit and Monitoring

8.1 Kubernetes API Audit Logging

NIST Controls: AU-2 (Audit Events), AU-3 (Content of Audit Records), AU-12 (Audit Generation)

RKE2 is configured with a Kubernetes audit policy that captures:

Audit logs are written in structured JSON format with the following fields (per NIST AU-3 requirements):

Audit logs are collected by Alloy and forwarded to Loki for centralized storage and querying.

8.2 Istio Access Logs

NIST Controls: AU-2 (Audit Events), AU-3 (Content of Audit Records)

Istio is configured to emit JSON-formatted access logs for every request passing through the mesh:

# From platform/core/istio/helmrelease-istiod.yaml
meshConfig:
  accessLogFile: "/dev/stdout"
  accessLogEncoding: "JSON"

Every Istio sidecar proxy logs:

These logs are collected by Alloy alongside application logs and stored in Loki with the same retention policies.

8.3 Centralized Logging with Loki

NIST Controls: AU-4 (Audit Storage Capacity), AU-6 (Audit Review and Reporting), AU-9 (Protection of Audit Information)

Grafana Loki is the centralized log aggregation system. Alloy (Grafana's collector, replacing Promtail) runs as a DaemonSet on every node and collects:

Retention policies:

Log Type Retention Period Rationale
Application logs 30 days Operational troubleshooting
Audit logs 90 days minimum Compliance requirement (NIST AU-4)
Security events 90 days minimum Incident investigation window

Log storage is encrypted at rest (see Section 7.3) and access is restricted via Grafana RBAC integrated with Keycloak. Only members of the audit-team Keycloak group have access to audit log datasources.

8.4 NeuVector Security Events

NIST Controls: SI-4 (System Monitoring), IR-5 (Incident Monitoring)

NeuVector generates security events for:

NeuVector events are exported via SYSLOG to Alloy, which forwards them to Loki. Prometheus metrics from NeuVector's exporter (port 8068) feed Grafana dashboards and AlertManager rules.

8.5 Kyverno PolicyReport Violations

NIST Controls: CA-7 (Continuous Monitoring), SI-6 (Security Function Verification)

Kyverno generates PolicyReport and ClusterPolicyReport CRDs that record:

Kyverno Reporter exports these results as Prometheus metrics, enabling:

8.6 Prometheus and Grafana

NIST Controls: SI-4 (System Monitoring), SI-5 (Security Alerts), CA-7 (Continuous Monitoring)

The kube-prometheus-stack provides real-time monitoring with pre-built dashboards for:

AlertManager is configured with escalation rules for security-relevant events:

Alert Severity Destination
Kyverno policy violation (critical) Critical PagerDuty + Slack
NeuVector process profile violation High Slack + Email
Certificate expiring within 7 days Warning Slack
Flux reconciliation failure Warning Slack
Pod running as root detected Critical PagerDuty + Slack
Image from unapproved registry Critical PagerDuty + Slack
Loki ingestion failure High Slack + Email

9. Incident Response

9.1 Detection

NIST Controls: IR-4 (Incident Handling), IR-5 (Incident Monitoring)

Security incidents are detected through multiple independent channels:

Detection Source Event Types Response Time
NeuVector alerts Runtime anomalies, process violations, DLP triggers, network violations Real-time (seconds)
Kyverno PolicyReports Policy violations on new or existing resources Near real-time (admission) or periodic (background scan)
Prometheus AlertManager Metric-based anomalies (CPU spikes, unusual traffic patterns, certificate issues) Configurable (default 5m evaluation)
Kubernetes audit logs Unauthorized API access attempts, privilege escalation, RBAC violations Near real-time via Loki alerting rules
Harbor Trivy scans New CVEs in deployed images (rescan) Periodic (configurable scan interval)

9.2 Response Procedures

When a security incident is detected, follow this procedure:

Phase 1: Triage (0-15 minutes)

  1. Acknowledge the alert in the alerting system (PagerDuty/Slack).
  2. Identify the affected namespace, pod, node, and service using Grafana dashboards.
  3. Determine the severity level based on the classification table below.
Severity Criteria Response SLA
Critical Active data exfiltration, container escape, cluster-level compromise Immediate (24/7 on-call)
High Unauthorized process execution, policy violation in production, credential exposure 1 hour
Medium Failed admission attempts, anomalous traffic patterns, non-critical CVE in running image 4 hours
Low Background policy report failures, development environment violations Next business day

Phase 2: Containment (15-60 minutes)

  1. Isolate the workload. Apply a NetworkPolicy that blocks all ingress and egress for the affected pod or namespace: ```bash kubectl apply -f - <<EOF apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: incident-isolation namespace: spec: podSelector: matchLabels: app: policyTypes:
    • Ingress
    • Egress EOF ```
  2. Preserve evidence. Capture pod logs, describe output, and NeuVector forensic data before any remediation: bash kubectl logs <pod> -n <namespace> --all-containers > incident-logs.txt kubectl describe pod <pod> -n <namespace> > incident-describe.txt kubectl get events -n <namespace> --sort-by='.lastTimestamp' > incident-events.txt
  3. Scale down if necessary. If the compromise is active, scale the affected Deployment to zero replicas to stop the workload while preserving the Deployment resource for investigation.
  4. Revoke credentials. If credential theft is suspected, rotate the affected secrets in OpenBao immediately. ESO will sync new values to Kubernetes.

Phase 3: Investigation (1-24 hours)

  1. Query Loki for all logs from the affected workload during the incident window.
  2. Query Istio access logs for network activity to/from the affected workload.
  3. Review NeuVector's network visualization for unexpected connections.
  4. Check Kyverno PolicyReports for any related policy violations.
  5. Review Kubernetes audit logs for API actions taken by the workload's ServiceAccount.
  6. Check Harbor for the image's Trivy scan results and SBOM to identify potential vulnerability exploits.

Phase 4: Remediation (24-72 hours)

  1. Patch the vulnerability or misconfiguration that caused the incident.
  2. Rebuild and rescan affected container images.
  3. Update Kyverno policies if the incident revealed a gap in admission control.
  4. Update NeuVector process profiles if legitimate behavior was incorrectly baselined.
  5. Deploy fixes via the standard GitOps workflow (PR, review, merge, Flux reconcile).

Phase 5: Post-Incident Review (within 1 week)

  1. Conduct a blameless post-incident review with all involved parties.
  2. Document the incident timeline, root cause, and remediation actions.
  3. Identify improvements to detection, containment, and response procedures.
  4. Update runbooks, alert rules, and policies based on lessons learned.
  5. File the incident report in the compliance record for auditors (NIST IR-6).

9.3 Disaster Recovery

NIST Controls: CP-9 (System Backup), CP-10 (System Recovery)

Velero provides cluster backup and restore capabilities:

For full cluster recovery, the procedure is:

  1. Provision new infrastructure using OpenTofu.
  2. Harden the OS using Ansible.
  3. Install RKE2.
  4. Bootstrap Flux CD (which reconciles the entire platform from Git).
  5. Restore application data from Velero backups.

Because the entire platform configuration is stored in Git, the cluster can be rebuilt from scratch in under 2 hours (infrastructure provisioning time is the bottleneck).


10. Threat Model

This section identifies key threats to the SRE platform and the specific mitigations in place for each.

10.1 Supply Chain Attacks

Threat: An attacker compromises an upstream container image, base image, or dependency to inject malicious code into the platform.

Mitigations:

Layer Control Component
Image source Only images from harbor.sre.internal are permitted Kyverno restrict-image-registries policy (CM-11)
Image integrity All images must be signed with Cosign Kyverno verify-image-signatures policy (SI-7)
Vulnerability scanning Trivy scans on push and periodic rescan Harbor + Trivy (RA-5)
Dependency tracking SBOM generated for every image Syft + Harbor OCI artifacts (CM-8)
Base image minimization Chainguard, distroless, or Alpine base images CI pipeline templates (CM-7)
Runtime detection Process and file system monitoring NeuVector enforcer (SI-3)

Residual risk: If a zero-day vulnerability exists in a dependency and has no CVE entry, Trivy will not detect it. NeuVector's behavioral monitoring is the last line of defense in this scenario.

10.2 Lateral Movement

Threat: An attacker who compromises one workload moves laterally to other services, escalating access within the cluster.

Mitigations:

Layer Control Component
Network isolation Default-deny NetworkPolicies in every namespace CNI enforcement (AC-4, SC-7)
Service identity mTLS STRICT with SPIFFE identity verification Istio PeerAuthentication (SC-8, IA-3)
Service authorization Deny-all AuthorizationPolicies with explicit allows Istio AuthorizationPolicy (AC-3)
Egress control REGISTRY_ONLY outbound traffic policy Istio meshConfig (SC-7)
Runtime segmentation Layer 7 network microsegmentation NeuVector (SC-7)
Credential isolation Per-workload ServiceAccount, no auto-mounted tokens Kubernetes RBAC (AC-6)

Residual risk: Lateral movement within a single namespace between pods of the same application is more difficult to detect. NeuVector behavioral baselines and Istio telemetry provide visibility here.

10.3 Data Exfiltration

Threat: An attacker extracts sensitive data from the cluster to an external destination.

Mitigations:

Layer Control Component
Egress control Default-deny egress NetworkPolicies CNI enforcement (AC-4)
Egress control REGISTRY_ONLY -- only registered external services permitted Istio outbound policy (SC-7)
DLP Deep packet inspection for sensitive data patterns NeuVector DLP sensors (SI-4)
DNS control Only kube-system DNS is reachable from pods NetworkPolicy egress rules
Audit trail All egress connections logged Istio access logs (AU-2)

Residual risk: Exfiltration through permitted egress channels (e.g., encoding data in legitimate HTTPS requests to an allowed service) is difficult to detect at the platform level. NeuVector's DLP sensors provide partial coverage, but application-level controls may be needed for high-sensitivity data.

10.4 Privilege Escalation

Threat: An attacker escalates from a non-root container process to root, or from container to node, to gain higher privileges.

Mitigations:

Layer Control Component
Container runtime runAsNonRoot: true, allowPrivilegeEscalation: false Kyverno require-security-context policy (AC-6)
Capabilities capabilities.drop: [ALL] on all containers Kyverno policy (CM-7)
Seccomp RuntimeDefault seccomp profile (blocks dangerous syscalls) Pod security context
Filesystem readOnlyRootFilesystem: true Pod security context (CM-7)
OS hardening SELinux enforcing mode, STIG-hardened kernel Ansible os-hardening role (CM-6)
Admission control No privileged containers, no host namespaces, no host path mounts Kyverno baseline policies (AC-6(10))
Runtime monitoring Unexpected process execution detected and blocked NeuVector enforcer (SI-3)

Residual risk: Kernel-level vulnerabilities (container escape via kernel exploit) are mitigated by SELinux and seccomp but cannot be fully prevented at the application layer. Timely kernel patching via the OS hardening pipeline is essential.

10.5 Credential Theft

Threat: An attacker steals credentials (ServiceAccount tokens, database passwords, API keys) to impersonate legitimate services or access protected resources.

Mitigations:

Layer Control Component
Token exposure automountServiceAccountToken: false by default Pod security context (AC-6)
Dynamic secrets Short-lived database credentials (1h TTL) OpenBao database engine (IA-5)
Secret rotation Automatic rotation and revocation OpenBao + ESO (SC-12)
Encryption at rest Kubernetes Secrets encrypted in etcd RKE2 default encryption (SC-28)
Secret access audit All OpenBao access logged OpenBao audit backend to Loki (AU-2)
mTLS identity Service identity tied to X.509 certificates, not bearer tokens Istio Citadel (IA-3)
Git hygiene No secrets stored in Git -- only ExternalSecret references ESO pattern (CM-3)

Residual risk: If an attacker gains access to the Kubernetes API with sufficient RBAC permissions, they can read Kubernetes Secrets directly. This is mitigated by tight RBAC scoping and Kubernetes API audit logging.

10.6 Denial of Service

Threat: An attacker overwhelms platform services or tenant workloads with excessive traffic or resource consumption.

Mitigations:

Layer Control Component
Resource limits CPU and memory limits enforced on all pods Kyverno require-resource-limits policy
Namespace quotas ResourceQuota and LimitRange per tenant namespace Tenant onboarding templates
Autoscaling HPA on workloads with configurable min/max replicas Helm chart templates
Rate limiting Istio rate limiting at the gateway Istio EnvoyFilter
PDB PodDisruptionBudget prevents total pod eviction Helm chart templates
Monitoring Resource usage alerts and capacity planning dashboards Prometheus + Grafana (SI-4)

11. Security Exceptions

Certain platform components require elevated privileges to function. These are documented exceptions with compensating controls.

NeuVector Enforcer DaemonSet

Exception: The NeuVector enforcer runs as a privileged DaemonSet with host network, host PID, and host filesystem access.

Justification: The enforcer must inspect container runtime state, monitor network traffic at the host level, and access the container runtime socket to perform runtime security scanning and enforcement.

Compensating controls: - NeuVector runs in a dedicated neuvector namespace with its own RBAC. - NetworkPolicies restrict NeuVector's network access to only required ports (controller cluster ports, Kubernetes API, DNS, metrics exporter). - The NeuVector namespace is excluded from tenant Kyverno policies but covered by platform-level monitoring. - All NeuVector actions are logged and forwarded to Loki.

Velero Node Agent

Exception: The Velero node agent requires privileged access to read persistent volume data on nodes.

Justification: File-level backups of persistent volumes require host filesystem access.

Compensating controls: - Velero runs in a dedicated velero namespace with restricted RBAC. - NetworkPolicies limit Velero's network access to the S3 backup destination and the Kubernetes API. - Backup operations are logged and monitored.

Istio Init Containers

Exception: Istio's istio-init init container runs with NET_ADMIN and NET_RAW capabilities to configure iptables rules for traffic interception.

Justification: The init container must modify the pod's network namespace to redirect traffic through the sidecar proxy.

Compensating controls: - The init container runs for a brief period and exits before the main application starts. - The main sidecar container runs without elevated privileges. - The init container image is from the pinned Istio release and verified by Harbor's scan pipeline.


12. References

Document Location
Architecture Overview docs/architecture.md
Architecture Decision Records docs/decisions.md
Compliance Mapping (NIST 800-53) docs/agent-docs/compliance-mapping.md
Kyverno Policy Patterns docs/agent-docs/kyverno-patterns.md
Flux CD Patterns docs/agent-docs/flux-patterns.md
Helm Chart Conventions docs/agent-docs/helm-conventions.md
Ansible Patterns docs/agent-docs/ansible-patterns.md
OpenTofu Patterns docs/agent-docs/tofu-patterns.md

External Standards

Standard Reference
NIST SP 800-53 Rev 5 https://csrc.nist.gov/publications/detail/sp/800-53/rev-5/final
NIST SP 800-207 (Zero Trust) https://csrc.nist.gov/publications/detail/sp/800-207/final
NIST SP 800-63B (Authentication) https://pages.nist.gov/800-63-3/sp800-63b.html
DISA RKE2 STIG https://www.stigviewer.com/stig/rancher_government_solutions_rke2/
DISA RHEL 9 STIG https://www.stigviewer.com/stig/red_hat_enterprise_linux_9/
CIS Kubernetes Benchmark https://www.cisecurity.org/benchmark/kubernetes
CMMC 2.0 https://dodcio.defense.gov/CMMC/
SPIFFE/SPIRE https://spiffe.io/